Clustering-based Phonetic Projection in Mismatched Crowdsourcing Channels for Low-resourced ASR
نویسندگان
چکیده
Acquiring labeled speech for low-resource languages is a difficult task in the absence of native speakers of the language. One solution to this problem involves collecting speech transcriptions from crowd workers who are foreign or non-native speakers of a given target language. From these mismatched transcriptions, one can derive probabilistic phone transcriptions that are defined over the set of all target language phones using a noisy channel model. This paper extends prior work on deriving probabilistic transcriptions (PTs) from mismatched transcriptions by 1) modelling multilingual channels and 2) introducing a clustering-based phonetic mapping technique to improve the quality of PTs. Mismatched crowdsourcing for multilingual channels has certain properties of projection mapping, e.g., it can be interpreted as a clustering based on singular value decomposition of the segment alignments. To this end, we explore the use of distinctive feature weights, lexical tone confusions, and a two-step clustering algorithm to learn projections of phoneme segments from mismatched multilingual transcriber languages to the target language. We evaluate our techniques using mismatched transcriptions for Cantonese speech acquired from native English and Mandarin speakers. We observe a 5–9% relative reduction in phone error rate for the predicted Cantonese phone transcriptions using our proposed techniques compared with the previous PT method.
منابع مشابه
Mismatched Crowdsourcing from Multiple Annotator Languages for Recognizing Zero-Resourced Languages: A Nullspace Clustering Approach
It is extremely challenging to create training labels for building acoustic models of zero-resourced languages, in which conventional resources required for model training – lexicons, transcribed audio, or in extreme cases even orthographic system or a viable phone set design for the language – are unavailable. Here, language mismatched transcripts, in which audio is transcribed in the orthogra...
متن کاملAnalysis of Mismatched Transcriptions Generated by Humans and Machines for Under-Resourced Languages
When speech data with native transcriptions are scarce in an under-resourced language, automatic speech recognition (ASR) must be trained using other methods. Semi-supervised learning first labels the speech using ASR from other languages, then re-trains the ASR using the generated labels. Mismatched crowdsourcing asks crowd-workers unfamiliar with the language to transcribe it. In this paper, ...
متن کاملImproving Under-Resourced Language ASR Through Latent Subword Unit Space Discovery
Development of state-of-the-art automatic speech recognition (ASR) systems requires acoustic resources (i.e., transcribed speech) as well as lexical resources (i.e., phonetic lexicons). It has been shown that acoustic and lexical resource constraints can be overcome by first training an acoustic model that captures acoustic-to-multilingual phone relationships on languageindependent data; and th...
متن کاملTranscribing continuous speech using mismatched crowdsourcing
Mismatched crowdsourcing derives speech transcriptions using crowd workers unfamiliar with the language being spoken. This approach has been demonstrated for isolated word transcription tasks, but never yet for continuous speech. In this work, we demonstrate mismatched crowdsourcing of continuous speech with a word error rate of under 45% in a large-vocabulary transcription task of short speech...
متن کاملMismatched Crowdsourcing: Mining Latent Skills to Acquire Speech Transcriptions
Automatic speech recognition (ASR) converts audio to text. ASR is usually trained using a large quantity of labeled data, i.e., audio with text transcription. In many languages, however, text transcription is hard to find, e.g., in both Hokkien and Dinka, we found native speakers who had received all their primary education in some other language, and who therefore had difficulty writing in the...
متن کامل